Search Results: "Manoj Srivastava"

18 November 2007

Manoj Srivastava: Manoj: The movie vaguely resembling Beowulf: an IMAX 3d experience

This should really be titled A movie vaguely representing Beowulf, but all sexed up with various salubrious elements . Hrothgar was treated much better in the original; and all the blatant and gratuitous sexuality brought in into the movie was a turn off. But then, I might be in the minority of the audience who had any familiarity with the poem. The characters in the movie seemed two dimensional caricatures (the only compelling performance was from Grendel s mother). And the changes made to the story line also lost the prowling menace of the latter years of the king of the Geats. After watching Hollywood debacles like this one, I am driven to wonder about why Hollywood writers seem to think they can so improve upon the work of writers whose story has stood the test of time. Making Beowulf into a boastful liar and cheat (even in the tale of the sea monsters his men imply that that they knew their lord was a liar) in an age where honor and battle prowess were everything I mean, what were the producers thinking? Most certainly not a movie I am going to recommend. I had not researched the movie much before I went into the show, and it was a surprise to me to see that this was an animated movie a la Final Fantasy , and while I was impressed with the computer graphics (reflections in general, and reflections of ripples in the water were astounding), the not a cartoon but not a realistic movie experience was a trifle distracting, and detracted from telling the tale. I like IMAX 3d, and the glasses are improving.

13 November 2007

Manoj Srivastava: Manoj: Deeds of Paksenarrion: III

Oath of Gold rounds up this excellent fantasy series from Elizabeth Moon. It is a pity that she never came back to this character (though she wrote a couple of prequels), despite the fact that the ending paragraph leaves ample room for sequels when the call of Gird came, Paksenarrion left for other lands. This is high fantasy in true Tolkien manner, but faster paced, more gritty, and with characters one could relate to. I am already looking forward to my next re-read of the series.

12 November 2007

Manoj Srivastava: Manoj: Deeds of Paksenarrion: II

Divided Allegiance is the middle of the trilogy, the one that I hate reading. Not because Ms Moon s book is bad, which it is not, it is still as gripping as the others, and comes closer to the high fantasy of Tolkien it is just that I hate what happens to Paks in the book, and the fact that the books ends, leaving her in that state. I guess I am a wimp when it comes to some things that happen to characters I am identifying with. However, it has been so long since I read the series that I have begun to forget the details, so I went through and read it anyway. This is a transition book: the Deeds of Paksenarrion was about Paksenarrion the line warrior, and the final book is where she becomes the stuff of legends. I usually read the first and last here.

10 November 2007

Manoj Srivastava: Manoj: The Secret Servant

I bought this book by Daniel Silva last week at SFO, faced with a long wait for the red eye back home, since I recalled hearing about it on NPR, and reading a review in Time magazine, or was it the New Yorker? Anyway, the review said he is his generation s finest writer of international intrigue, one of America s most gifted spy novelists ever. I guess Graham Greene and John le Carre belong to an older generation. Anyway, everything I read or heard about it was very positive. Daniel Silva is far less cynical than Le Carre, and his world does not gel quite as well, to my ears, as Smiley s circus did. The hero, Gabriel Allon, does have some super human traits, but, thank the lord, is not James bond. I was impressed by Silva s geo-politics, though - paragraphs from the book seem to appear almost verbatim in current event reports in the International Herald Tribune and BBC stories. I like this books (to the extent of ordering another 7 from this author from Amazon today), and appreciate the influx of new blood in the international espionage market. Lately, the genre has been treated by lack luster, mediocre knock offs of the Bourne Identity and the engaging pace of the original has never been successfully replicated in the sequels. And Silva s writing is better than Ludlum s.

Manoj Srivastava: Manoj: Adventures in the windy city

I have just come back from a half week stay at the Hilton Indian Lakes resort (which is the second time in a month that I have stayed at a golf resort and club and proceeded to spend 9 hours a day in a windowless conference room). On Thursday night, an ex Chicago native wanted to show us the traditional Chicago pizza (which can be delivered, half cooked, and frozen, via Fed-Ex, anywhere in the lower 48). Google Maps to the rescue! One of the attendees had a car, and we piled in and drove to the nearest pizzeria. It was take out only. We headed to the next on the list, again to be met with disappointment; since making the pizza takes the best part of an hour, and we did not want to be standing out in a chilly parking lot while they made out pizza. So, I strongly advocated going to Tapas Valencia instead, since I have never had tapas before. Somewhat to our disappointment, they served tapas only as an appetizer, and had a limited selection; so we ended up ordering one tapas dish (I had beef kabobs with a garlic horseradish sauce and caramelized onions), and my very first paella (paella valencia), with shrimp, mussels, clams, chicken, and veggies. We ate well, and headed back to the hotel. As we parked, and started for the gate, I realized I no longer had my wallet with me so back to the restaurant we went. The waiter had not found the wallet. Nor had the busboy. The owner/hostess suggested perhaps it was in the parking lot? So we all went and combed the parking lot once, twice. At this point I am beginning to think about the consequences I can t get home, because I can t get into the airport, since I have no ID. I have no money, but Judy can t wire the money to me via western union because I have no ID. I need money to buy greyhound tickets to get home on a bus and then there is the cancelling credit cards, etc. Panic city. While I was on my fourth circuit of the parking lot, the owner went back and checked the laundry chute. I had apparently carelessly draped the napkin over my wallet when paying the tab, and walked away and the busboy just grabbed all the napkins, wallet and all, and dumped it down the chute. Judy suggests I carry an alternate form of ID and at least one credit card in a different location than my wallet for future trips. If that was not excitement enough, yesterday, I got on the plane home, uneventfully enough. We took off, and I was dozing comfortably, when there were two loud bags, and the plane juddered and lister to the port. There was a smell of burning rubber, and we stopped gaining altitude. After making a rough about turn with the left wing down, the pilot came on the intercom to say We just lost our left engine, and we are returning to O Hare. We should be in the ground in two minutes . Hearing the in , a guy up front started hyperventilating, and his wife was rubbing his back. My feelings were mostly of exasperation, I had just managed to get myself situated comfortably, and now lord only knows when we would get another aircraft. When we landed, the nervous dude reached over and kissed his wife like he had just escaped the jaws of death. And he asked if any of us knew statistics, and if we were fine now. (I was tempted to state that statistics are not really predictive, but hey). It was all very pretty, with six fire engines rushing over and spraying us with foam and all. When we got off the plane the nervous dude headed straight to some chairs in the terminal, and said his legs would not carry him further. He did make it to the replacement plane later, though. Turns out it was a bird flying into the engine that caused the flameout. Well, at least I have a story to tell, though it delayed getting home by about three hours.

8 November 2007

Manoj Srivastava: Manoj: Deeds of Paksenarrion

Sheep Farmers Daughter is an old favourite, which I have read lord only knows how many times. Elizabeth Moon has written a gritty, enthralling story of the making of a Paladin. This is the first book of a trilogy, and introduces us to a new universe through the eyes of a young innocent (which is a great device to introduce us to a universe from the viewpoint of someone who is not seeing it through eyes jaundiced by experience). For me, books have always been an escape from the humdrum mundanity of everyday existence. Putting myself in the shoes of a character in the story is the whole point; and this story excels there: it is very believable. Not many people can tell a tale that comes alive, and Ms Moon is one of them. An ex-marine, much of the detail of the military life of Paks has been drawn from Moon s own military experience. More than just that, the world is richly drawn, and interesting. I read this book in a hotel room in Chicago, since, as usual, there was nothing really interesting on TV, and I don t get the whole bar scene.

6 November 2007

Manoj Srivastava: Manoj: Continuous Automated Build and Integration Environment

One of the things I have been tasked to do in my current assignment is to create a dashboard of the status of various software components created by different contractors (participating companies) in the program. These software components are built by different development groups, utilizing unlike toolsets, languages and tools though I was able to get an agreement on the VCS (subversion yuck). Specifically, one should be able to tell which components pass pre-build checks, compile, can be installed, and pass unit and functional tests. There should be nightly builds, as well as builds whenever someone checks in code on the release branches. And, of course, the dashboard should be HTTP accessible, and be bright and, of course, shiny. My requirements were that since the whole project is not Java, there should be no dependencies on maven or ant or eclipse projects (or make, for that matter); that it should be able to do builds on multiple machines (license constraints restrict some software to Solaris or Windows), not suck up too much time from my real job (this is a service, if it is working well, you get no credit, if it fails, you are on the hot seat). And it should be something I can easily debug, so no esoteric languages (APL, haskell and Python :P) So, using continuous integration as a google search term, I found the comparison matrix at Damage Control I looked at anthill, and cruisecontrol, and the major drawback people seemed to think it had was that configuration was done by editing an XML file, as opposed to a (by some accounts, buggy) UI is not much of a factor for me. (See this). I also like the fact that it seems easier to plug in other components. I am uncomfortable with free software that has a commercial sibling; we have been burned once by UML software with those characteristics. Cruisecontrol, Damagecontrol, Tinderbox1 & tinderbox2, Continuum, and Sin match. I tried to see the demo versions; Sin s link led me to a site selling Myrtle Beach condo s, never a good sign. Continuum and Damagecontrol were currently down, so I could not do an evaluation. So, here are the ones I could get to with working demo pages: http://cclive.thoughtworks.com/ and http://tinderbox.mozilla.org/showbuilds.cgi?tree=SeaMonkey Cruisecontrol takes full control, checking things out of source control; and running the tests; which implies that all the software does build and run on the same machine this is not the case for me. Also, CC needs to publish the results/logs in XML; which seems to be a good fit for the java world; but might be a constraint for my use case. I like the tinderbox dashboard better, based on the information presented; but that is not a major issue. It also might be better suited for a distributed, open source development model; cruisecontrol seems slightly more centralized more on this below. cruisecontrol is certainly more mature; and we have some experience with it. Tinderbox has a client/server model, and communicates via EMAIL to a number of machines where the actual build/testing is done. This seems good. Then there is flamebox nice dashboard, derivative of tinderbox2; and seems pretty simple (perhaps too simple); and easily modifiable. However, none of these seemed right. There was too much of an assumption of a build and test model and few of them seemed to be a good fit for a distributed, Grid-based software development; so I continued looking. Cabie screen shot. I finally decided CABIE
Continuous Automated Build and Integration Environment. Cabie is a multi-platform, multi-cm client/server based application providing both command line and web-based access to real time build monitoring and execution information. Cabie builds jobs based upon configuration information stored in MySQL and will support virtually any build that can be called from the command line. Cabie provides a centralized collection point for all builds providing web based dynamic access, the collector is SQL based and provides information for all projects under Cabie s control. Cabie can be integrated with bug tracking systems and test systems with some effort depending on the complexity of those systems. With the idea in mind that most companies create build systems from the ground up, Cabie was designed to not have to re-write scripted builds but instead to integrate existing build scripts into a smart collector. Cabie provides rapid email notification and RSS integration to quickly handle build issues. Cabie provides the ability to run builds in parallel, series, to poll jobs or to allow the use of scripted nightly builds. Cabie is perfect for agile development in an environment that requires multiple languages and tools. Cabie supports Perforce, Subversion and CVS. The use of a backend broker allows anyone with perl skills to write support for additional CM systems.
The nice people at Yo Linux have provided a Tutorial for the process. I did have to make some changes to get things working (mostly in line with the changes recommended in the tutorial, but not exactly the same. I have sent the patches upstream, but upstream is not sure how much of it they can use, since there has been major progress since the last release. The upstream is nice and responsive, and have added support in unreleased versions for using virtual machines to run the builds in (they use that to do the solaris/windows builds), improved the web interface using (shudder) PHP, and and all kinds of neat stuff.

5 November 2007

Manoj Srivastava: Manoj: Filtering accuracy: Hard numbers

I have often posted on the accuracy of my mail filtering mechanisms on the mailing lists (I have not had a false positive in years, and I stash all discards/rejects locally, and do spot checks frequently; and I went through 6 months of exhaustive checks when I put this system in place). False negatives are down to about 3-4 a month (0.019%). Yes, that is right: I am claiming that my classification correctness record is 99.92 (99.98% accuracy for messages my classifiers are sure about). Incorrectly classified unsure ham is about 3-4(0.019%) a month; incorrectly classified unsure Spam is roughly the same, perhaps a little higher. Adding these to the incorrect classification, my best estimate of not confidently classified mail is 0.076%, based on the last 60 days of data (which is what gets you the 99.92%). I get unsure/retrain messages at the rate of about 20 a day (about 3.2% of non-spam email) about 2/3 rds of which are classified correctly; but either SA and crm114 disagree, or crm114 is unsure. So I have to look at about 20 messages a day to see if a ham message slipped in there; and train my filters based on these; and the process is highly automated (just uses my brain as a classifier). The mail statistics can be seen on my mail server. Oh, my filtering front end also switches between reject/discard and turns grey listing on and off based on whether or not the mail is coming from mailing lists/newsletters I have authorized; mimedefang-filter However, all these numbers are manually gathered, and I still have not gotten around to automating my setup s overall accuracy, but now I have some figures on one of the two classifies in my system. Here is the data from CRM114. I ll update the numbers below via cron. First, some context: when training CRM114 using the mailtrainer command, one can specify to leave out a certain percentage of the training set in the learn phase, and run a second pass over the mails so skipped to test the accuracy of the training. The way you do this is by specifying a regular expression to match the file names. Since my training set has message numbers, it was simple to use the least significant two digits as a regexp; but I did not like the idea of always leaving out the same messages. So I now generate two sets of numbers for every training run, and leave out messages with those two trailing digits, in effect reserving 2% of all mails for the accuracy run. An interesting thing to note is the assymetry in the accuracy: CRM114 has never identified a Spam message incorrectly. This is because the training mechanism is skewed towards letting a few spam messages slip through, rather than let a good message slip into the spam folder. I like that. So, here are the accuracy numbers for CRM114; adding in Spamassassin into the mix only improves the numbers. Also, I have always felt that a freshly learned css file is somewhat brittle in the sense that if one trains an unsure message, and then tried to TUNE (Train Until No Errors) the css file, a large number of runs through the training set are needed until the thing stabilizes. So it is as if the learning done initially was minimalistic, and adding the information for the new unsure message required all kinds of tweaking. After a while TOEing (Training on Errors) and TUNEing, this brittleness seems to get hammered out of the CSS files. I also expect to see accuracy rise as the css files get less brittle The table below starts with data from a newly minted .css file.
Accuracy number and validation regexp
Date Corpus Ham Spam Overall Validation
  Size Count Correct Accuracy Count Correct Accuracy Count Correct Accuracy Regexp
Wed Oct 31 10:22:23 UTC 2007 43319 492 482 97.967480 374 374 100.000000 866 856 98.845270 [1][6][_][_] [0][3][_][_]
Wed Oct 31 17:32:44 UTC 2007 43330 490 482 98.367350 378 378 100.000000 868 860 99.078340 [3][7][_][_] [2][3][_][_]
Thu Nov 1 03:01:35 UTC 2007 43334 491 483 98.370670 375 375 100.000000 866 858 99.076210 [2][0][_][_] [7][9][_][_]
Thu Nov 1 13:47:55 UTC 2007 43345 492 482 97.967480 376 376 100.000000 868 858 98.847930 [1][2][_][_] [0][2][_][_]
Sat Nov 3 18:27:00 UTC 2007 43390 490 480 97.959180 379 379 100.000000 869 859 98.849250 [4][1][_][_] [6][4][_][_]
Sat Nov 3 22:38:12 UTC 2007 43394 491 482 98.167010 375 375 100.000000 866 857 98.960740 [3][1][_][_] [7][8][_][_]
Sun Nov 4 05:49:45 UTC 2007 43400 490 483 98.571430 377 377 100.000000 867 860 99.192620 [4][6][_][_] [6][8][_][_]
Sun Nov 4 13:35:15 UTC 2007 43409 490 485 98.979590 377 377 100.000000 867 862 99.423300 [3][7][_][_] [7][9][_][_]
Sun Nov 4 19:22:02 UTC 2007 43421 490 486 99.183670 379 379 100.000000 869 865 99.539700 [7][2][_][_] [9][4][_][_]
Mon Nov 5 05:47:45 UTC 2007 43423 490 489 99.795920 378 378 100.000000 868 867 99.884790 [4][0][_][_] [8][3][_][_]
As you can see, the accuracy numbers are trending up, and already are nearly up to the values observed on my production system.

4 November 2007

Manoj Srivastava: Manoj: The White Company

I had somehow managed to miss out on The White Company while I was growing up and devouring all of Sherlock Holmes stories and The Lost World. This is a pity, since I would have like this bit of the hundred years war much better when I was young and uncritical. Oh, I do like the book. The pacing is fast, if somewhat predictable. The book is well researched, and leads you from one historic event to the other, and is peppered with all kinds of historical figures, and I believe it to be quite authentic in it s period settings. Unfortunately, there is very little character development, and though the characters are deftly sketched, they all lack depth, which would not have bothered the young me. Also, Sir John Hawkwood, of the white company, is mentioned only briefly in passing. This compares less favourably than Walter Scott s Quentin Durward, set in a period less than 80 years in the future. but then, I ve always have had a weakness for Scott. As for Conan Doyle, the lost world was far more gripping. I am now looking for books about Hawkwood, a mercenary captain mentioned in this book, as well as Dickson s Childe Cycle books. The only books I have found so far on the golden age of the Condottieri are so darned expensive.

28 September 2007

Martin F. Krafft: Counting developers

For my research I wanted to know how to obtain the exact number of Debian developers. Thanks to help from Andreas Barth and Manoj Srivastava, I can now document the procedure:
$ ldapsearch -xLLLH ldap://db.debian.org -b ou=users,dc=debian,dc=org \
  gidNumber=800 keyFingerPrint \
    sed -rne ':s;/^dn:/bl;n;bs;:l;n;/^keyFingerPrint:/ p;bs ' \
    wc -l
1049
This actually seems enough as I do not recall any new maintainers being added since the last call for votes, which gives 1049 as well. Andreas told me to count the number of entries in LDAP with GID 800 and an associated key in the Debian keyring. Manoj's dvt-quorum script also takes the Debian keyrings (GPG and PGP) into account, so I did the same:
$ ldapsearch -xLLLH ldap://db.debian.org -b ou=users,dc=debian,dc=org \
  gidNumber=800 keyFingerPrint \
    sed -rne ':s;/^dn:/bl;n;bs;
              :l;n;/^keyFingerPrint:/ s,keyFingerPrint: ,,p;bs ' \
    sort -u > ldapfprs
$ rsync -az --progress \
  keyring.debian.org::keyrings/keyrings/debian-keyring.gpg \
  ./debian-keyring.gpg
$ gpg --homedir . --no-default-keyring --keyring debian-keyring.gpg \
  --no-options --always-trust --no-permission-warning \
  --no-auto-check-trustdb --armor --rfc1991 --fingerprint \
  --fast-list-mode --fixed-list-mode --with-colons --list-keys \
    sed -rne 's,^fpr:::::::::([[:xdigit:]]+):,\1,p' \
    sort -u > gpgfprs
$ rsync -az --progress \
  keyring.debian.org::keyrings/keyrings/debian-keyring.pgp \
  ./debian-keyring.pgp
$ gpg --homedir . --no-default-keyring --keyring debian-keyring.pgp \
  --no-options --always-trust --no-permission-warning \
  --no-auto-check-trustdb --armor --rfc1991 --fingerprint \
  --fast-list-mode --fixed-list-mode --list-keys \
    sed -rne 's,^[[:space:]]+Key fingerprint = ,,;T;s,[[:space:]]+,,gp' \
    sort -u > pgpfprs
$ sort ldapfprs pgpfprs gpgfprs   uniq -c \
    egrep -c '^[[:space:]]+2[[:space:]]'
1048
MAN OVER BOARD! Who's the black sheep? Update: In the initial post, I forgot the option --fixed-list-mode and hit a minor bug in gnupg. I have since updated the above commands. Thus, there is no more black sheep and the rest of this post only lingers here for posterity.
while read i; do
  grep "^$ i $" pgpfprs gpgfprs   echo $i >&2
done < ldapfprs >/dev/null
which returns 9BF093BC475BABF8B6AEA5F6D7C3F131AB2A91F5
$ gpg --list-keys 9BF093BC475BABF8B6AEA5F6D7C3F131AB2A91F5
pub   4096R/AB2A91F5 2004-08-20
uid                  James Troup <james@nocrew.org>
our very own keyring master James Troup. So has James subverted the project? Is he actually not a Debian developer? Given the position(s) he holds, does that mean that the project is doomed? Ha! I am so tempted to end right here, but since my readers are used to getting all the facts, here's the deal: James is so special that he gets to be the only one to have a key in our GPG keyring which can be used for encryption, or so I found out as I was researching this. Now this bug in gnupg actually causes his fingerprint not to be printed. Until this is fixed (if ever), simply leave out --fast-list-mode in the above commands. NP: Oceansize: Effloresce

21 August 2007

Manoj Srivastava: Arch Hook

All the version control systems I am familiar with run scripts on checkout and commit to take additional site specific actions, and arch is no different. Well, actually, arch is perhaps different in the sense that arch runs a script on almost all actions, namely, ~/.arch-params/hook script. Enough information is passed in to make this mechanism one of the most flexible I have had the pleasure to work with. In my hook script, I do the following things: I d be happy to hear about what other people add to their commit scripts, to see if I have missed out on anything.

20 August 2007

Manoj Srivastava: Mail Filtering with CRM114: Part 4

Training the Discriminators It has been a while since I posted on this category actually, it has been a long while since my last blog. When I last left you, I had mail (mbox format) folders called ham and/or junk, which were ready to be used for training either CRM114 or Spamassassin or both. Setting up Spamassassin This post lays the groundwork for the training, and details how things are set up. The first part is setting up Spamassassin. One of the things that bothered me about the default settings for Spamassassin was how swiftly Bayes information was expired; indeed, it seems really eager to dumb the Bayes information (don t they trust their engine?). I have spent some effort building a large corpus, and keeping ti clean, but Spamassassin would discard most of the information from the DB after training over my corpus, and the decrease in accuracy was palpable. To prevent this information from leeching away, I firstly increased the size of the database, and turned off automatic expiration, by putting the following lines into ~/.spamassassin/user_prefs:
bayes_expiry_max_db_size  4000000
bayes_auto_expire         0

I also have regularly updated spam rules from the spamassassin rules emporium to improve the efficiency of the rules; my current user_prefs is available as an example. Initial training I keep my Spam/Ham corpus under the directory /backup/classify/Done, in the subdirectories Ham and Spam. At the time of writing, I have approximately 20,000 mails in each of these subdirectories, for a total of 41,000+ emails. I have created a couple of scripts to train the discriminators from scratch using the extant Spam corpus; and these scripts are also used for re-learning, for instance, when I moved from a 32-bit machine to a 64-bit one, or when I change CRM114 discrimators. I generally run them from ~/.spamassassin/ and ~/var/lib/crm114 (which contains my CRM114 setup) directories. I have found that training Spamassassin works best if you alternate Spam and Ham message chunks; and this Spamassassin learning script delivers chunks of 50 messages for learning. With CRM114, I have discovered that it is not a good idea to stop learning based on the number of times the corpus has been gone over; since stopping before all messages i the Corpus are correctly handled is also disastrous. So I set the repeat count to a ridiculously high number, and tell mailtrainer to continue training until a streak larger than the sum of Spam and Ham messages has occurred. This CRM114 trainer script does the hob nicely; running it under screen is highly recommend. Routine updates Coming back to where we left off, we had mail (mbox format) folders called ham and/or junk sitting in the local mail delivery directory, which were ready to be used for training either CRM114 or Spamassassin or both. There are two scripts that help me automate the training. The first script, called mail-process, does most of the heavy listing. This processes a bunch of mail folders, which are supposed to contain mail which is either all ham or all spam, indicated by the command line arguments. We go looking though every mail, and any mail where either the CRM114 or the Spamassassin judgement was not what we expected, we strip out mail gathering headers, and then we save the mail, one to a file, and we train the approprite filter. This ensures that we only train on error, and it does not matter if we accidentally try to train on correctly classified mail, since that would be a no-op (apart from increasing the size of the corpus). The second script, called mproc is a convenience front-end; it just calls mail-process with the proper command line arguments, and feeds them the ham and junk in sequence; and takes no arguments. So, after human classification, just calling mproc does the classification. This pretty much finishes the series of posts I had in mind about spam filtering, I hope it has been useful.

14 January 2007

Manoj Srivastava: Mail Filtering with CRM114: Part 3

Uphold, maintain, sustain: life in the trenches Now that I have a baseline filter, how do I continue to train it, without putting too much of an effort? There are two separate activities here, firstly selecting the mails to be used in training, and secondly, automating the training and saving to the mail corpus. On going training is essential; Spam mutates, and even ham changes over time, and well trained filters drift. However, if training disrupts normal work-flow, it won't happen; so a minimally intrusive set of tools is critical Selecting mail to train filters There are three broad categories of mails that fit the criteria: 1. Misclassified mail This is where human judgement comes in, to separate the wheat from the chaff. 2. Partially misclassified mail This is mail correctly classified overall, but misclassified by either crm114, or spamassassin, but not both. This is an early warning sign, and is more common than mail that is misclassified, since usually the filter that is wrong is wrong weakly. But this is the point where training should occur, so that the filter does not drift to the point that mails are misclassified. Again, the mlg script catches this. 3. Mail that crm114 is unsure about This is Mail correctly classified, but something that mailreaver is unsure about -- and this category is why mailreaver learns faster than mailtrainer. Spam handling and disposition Spam handling schema. At this point I should say something about how I generally handle mails scored as Spam by the filters. As you can see, the mail handling is simple; depending on the combined score given to the mail by the filters. The handling rules are: So, any mail with score less than 15 is accepted, potentially after grey-listing. The disposition is done according to the following set of rules: In the last 18+months, I have not seen a Ham mail in my realspam folder; chances of Ham being rejected are pretty low. My Spam folder gets a ham message every few months, but these are usually spamassassin misclassifying things; and mlg detects those. I have not seen one of these in the last 6 months. So my realspam canary has done wonders for my peace of mind. With ideally trained filters, spam and realspam folders would be empty. mail list grey I have created a script called mlg ("Mail List Grey") that I run periodically over my mail folder, that picks out mails that either (a) are classified differently by spamassassin and crm114, or (b) are marked as unsure by mailreaver. The script takes these mails and saves them into a grey.mbox folder. I tend to run them over Spam and non-Spam folders in different runs, so that the grey.mbox folder can be renamed to either ham or junk, in the vast majority of the cases. Only for misclassified mails do I have to individually pick the misplaced email and classify it separately from the rest of the emails in that batch. At this point, I should have mail mbox folders called ham and/or junk, which are now ready to be used for training either crm114 or spamassassin or both. Processing these folders is the subject of the next article in this series.

11 January 2007

Manoj Srivastava: Mail Filtering with CRM114: Part 2

Or, Cleanliness is next to godliness The last time when I blogged about Spam fighting Mail Filtering With CRM114 Part 1, I left y'all with visions of non-converging learning, various ingenious ways of working around a unclean corpus, and perhaps a sinking feeling that this whole thing was more fragile than it ought to be. During this eriod of flailing around, trying to get mailtrainer to learn the full corpus correctly, I upgraded to an unpackaged version of crm114. Thanks to the excellent packaging effort by the maintianers, this was dead easy: get the debian sources using apt-get source crm114, download tghe new tarball from crm114 upstream, cp the debian dir over, and just edit the changelog file to reflect the new version. I am currently running my own, statically linked 20061103-Blame Dalkey.src-1. Cleaning the corpus made a major difference to the quality of discrimination. As mentioned earlier, I examined every mail that was initially incorrectly classified during learning. Now, there are two ways this can happen: That the piece of mail was correctly placed in the corpus, but had a feature that was different from those learned before; or that it was wrongly classified by me. When I started the chances were almost equally likely; I have now hopefully eliminated most of the misclassifications. When mailtrainer goes into cycles, retraining on a couple of emails round after round, you almost certainly are trying to train in conflicting ways. Cyclic retraining is almost always a human's error in classification. Some of the errors discovered were not just misclassifications: some where things that were inappropriate mail, but not Spam; for instance there was the whole conversation where someone one subscribed debian-devel to another mailing list, there was the challenge, the subscription notice, the un-subscription, challenge, and notice -- all of which were inappropriate, and interrupted the flow, and contributed to the noise -- but were not really Spam. I had, in a fit of pique, labelled them as Spam; but they were really like any other mailing list subscription conversations, which I certainly want to see for my subscriptions. crm114 did register the conversations as Spam and non-Spam, as requested, but that increased the similarity between Spam and non-Spam features -- and probably decreased the accuracy. I've since decided to train only on Spam, not on inappropriate mails; and let Gnus keep inappropriate mails from my eyes. I've also waffled over time about whether or not to treat newsletters from Dr. Dobbs Journal or USENIX as Spam or not -- now my rule of thumb is that since I signed up for them at some point, they are not Spam -- though I don't feel guilty about letting mailagent squirrel them away mostly out of sight. A few tips about using mailtrainer:

24 December 2006

Manoj Srivastava: Arch, Ikiwiki, blogging

One of the reasons I have only blogged 21 times in thirty months is because of the very kludgey work flow I had for blogging; I had to manually create the file, and then scp by hand, and ensure that any ancillary files were in place on the remote machine that serves up my blogs. After moving to ikiwiki, and thus arch, there would be even more overhead, were it not so amenable to scripting. Since this is arch, and therefore creating branches and merging is both easy and natural, I have two sets of branches -- one set related to the templates and actual blog content I server on my local, development box, and a parallel set of branches that I publish. The devel branches are used by ikiwiki on my local box; the remote ikiwiki uses the publish branch. So I can make changes to my hearts content on the devel branch, and the merge into my publish branch. When I commit the publish branches, the hook function ensure that there is a fresh checkout of the publish branch on the remote server, and that ikiwiki is run to regenerate web pages to reflect the new commit. The hook functions are nice, but not quite enough to make bloggin as effortless as it could be. With the movge to ikiwiki, and dissociation of classification and tagging from the file system layout, I have followed the lead of Roland Mas and organized my blog layout by date; posts are put in blog/$year/$month/$escaped_title. The directory hierarchy might not exist for a new year or month. A blog posting may also show in in two different archive indices: the annual archive index for the year, and a monthly index page created for every month I blog in. However, at the time of writing, there is no annual index for the next year (2007), or the next month (January 2007). These have to be created as required. All this would get quite tedious, and indeed, would frequently remain undone -- were it not for automation. To make my life easier, I have blogit!, which takes care of the niggling details. When called with the title of the prospective post; this script figures out the date, ensures that the blog directory structure exists, creating path components and adding them to the repository as required, creates a blog entry template, adds the blog entry to the repository, creates the annual or the monthly archive index and adds those to the repository as needed, and finally, calls emacs on the blog posting file. whew.

23 December 2006

Manoj Srivastava: Mail Filtering With CRM114: Part 1

The first step is to configure the CRM114 files, and these are now in pretty good shape as shipped with Debian. All that I needed to do was set a password, say that I'd use openssl base64 -d, and stick with the less verbose defaults (so no munging subjects, no saving all mail, no saving rejects, etc, since I have other mechanisms that do all that). The comments in the configuration files are plenty good enough. This part went off like a breeze; the results can be found here (most of the values are still the upstream default). The next step was to create new, empty .css files. I have noticed that creating more buckets makes crm114 perform better, so I go with larger than norm crm114 .css files. I have no idea if this is the right thing to do, but I make like Nike and just do it. At some point I'll ask on the crm114-general mailing list.
    % cssutil -b -r -S 4194000 spam.css
    % cssutil -b -r -S 4194000 nonspam.css

Now we have a blank slate; at this time the filter knows nothing, and is equally likely to call something Spam or non-Spam. We are now ready to learn. So, I girded my loins, and set about feeding my whole mail corpus to the filter:
 /usr/share/crm114/mailtrainer.crm               \
   --spam=/backup/classify/Done/Spam/           \
   --good=/backup/classify/Done/Ham/            \
   --repeat=100 --streak=35000                  \
         egrep -i '^ + train Excell Running'

And this failed spectacularly (see bug #399306). Faced with unexpected segment violations, and not being proficient in crm114's rather arcane syntax, I was forced to speculation: I assumed (as it turns out, incorrectly) that if if you throw too many changes at the crm114 database, things rapidly escalate out of control. I went o to postulate that as my mail corpus was gathered over a period of errors, the characteristic of Spam drifted over time, and what I consider Spam has also evolved. So, some parts of the early corpus are at variance with the more recent bits. Based on this assumption, I created a wrapper script which did what Clint has called training to exhaustion -- it iterated over the corpus several times, starting with a small and geometrically increasing chunk size. Given the premise I was working under, it does a good job of training crm114 on a localized window of Spam: it feeds chunks of the corpus to the trainer, with each successive chunk overlapping the previous and succeeding chunks, and ensuring that crm114 is happy at any given chunk of the corpus. Then it doubles the chunk size, and tries goes at it again. All very clever, and all pretty wrong. I also created another script to retrain crm114, which was less exhaustive than the previous one, but did incorporate any further drift in the nature of Spam. I no longer use these scripts; but I would like to record them for posterity as an indication of how far one can take an hypothesis. What it did to was have crm114 learn without segfaulting -- and show me that there was a problem in the corpus. I noticed that in some cases the trainer would find a pair of mail messages, classify them wrongly, and retrain and refute -- iteration after iteration, back and forth. I noticed this when I added the egrep filter above, and was not drowning in the needless chatter from the trainer. It turns out, I had very similar emails (sometimes, even the same email) in the Ham and the Spam corpus, and no wonder crm114 was throwing hissy fits. Having small chunks ensured that I had not too many such errors in any chunk;and crm114 did try to forget a mail differently classified in an older chunk and learn whatever this chunk was trying to teach it. The downside was that the count of the differences between ham nd Spam went down, and the similarities increased -- which meant that the filter was not as good at separating ham and Spam as it could have been. So my much vaunted mail corpus was far from clean -- over the years, I had misclassified mails, been wishy-washy and changed my mind about what was and was not Spam. I have a script that uses md5sums to find duplicate files in a directory, and found, to my horror, that there were scores of duplicates in the corpus. After eliminating outright duplicates, I started examining anything that showed up with an ER (error, refute) tag in the trainer output; on the second iteration of the training script these were likely to be misclassification. I spent days examining my corpus and cleaning it out; and was gratified to see the ratio of differences to similarities between ham and Spam css files climb from a shade under 3 to around 7.35. Next post we'll talk about lessons learned about training, and how a nominal work flow of training on errors and training when classifiers disagree can be set up.

19 December 2006

Manoj Srivastava: Mail Filtering With CRM114: Introduction

I have a fairly sophisticated Spam filtering mechanism setup, using MIMeDefang for SMTP level rejection of Spam, grey-listing for mails not obviously ham or Spam, and using both crm114 and spamassassin for discrimination, since they compensate for each other when one of them can't tell what it is looking at. People have often asked me to write up the details of my setup, and the support infrastructure, and I have decided to oblige. What follows is a series of blog posts detailing how I set about redoing my crm114 setup, with enough detail that interested parties can tag along. I noticed that the new crm114 packages have a new facility called mailtrainer, which can be used to setup initial css database files. I liked the fact that it can run over the training data several times, it can keep back portions of the training data as a control group, and you can tell it to keep on going until it gets a certain number of mails discriminated correctly. This is cool, since I have a corpus of about 15K Spam and 16K ham messages, mostly stored on my previous train-on-error practice (whenever crm114 classified a message incorrectly, I train it, and store all training emails). I train whenever crm114 and spamassassin disagree with each other. This would also be a good time to switch from crm114's mailfilter to the new mailreaver, which is the new, 3rd Generation mail filter script for crm114. It is easier to maintain, and since it flags mails for which it is unsure, and you ought to train all such mails anyway, it learns faster. Also, since the new packages come with a brand new discrimination algorithms which are supposed to be as accurate but also faster, but may store data in incompatible ways, I figured that it might be time to redo my CRM mail filter from the ground up. The default now is "osb unique microgroom". This change requires me to empty out the css files. I also decided to change my learning scripts to not send command via email to a 'testcrm' user, instead, now I train the filter using a command line mode. Apart from saving mails incorrectly classified into folders ham and junk (Spam was already taken), I have a script that grabs and saves mails which are classified differently by crm114 and spamassassin from my incoming mail spool and saves it into a grey.mbox file, which I can manually separate out into ham and junk. Then I have a processing script, that takes the ham and junk folders and trains spamassassin, or crm114, or both; and stashes the training files away into my cached corpus for the future. In subsequent blog postings, I'll walk people through how I setup and initialized my filter, and provide examples of the scripts I used, along with various and sundry missteps and lessons learned and all.

Wouter Verhelst: Dunc-Tank, Debian, and crises in general

For those who've been hiding under a rock for the last few months: dunc-tank is the (controversion) project that was founded by the current DPL, Anthony 'aj' Towns. Controversial, because a lot of people feel that this goes against the spirit of Debian; and that by paying some Debian people and not paying others, we're creating a two-class system that we'd best avoid. And probably a bunch of other arguments, too, but that's not what this blog post is about. Me, I've been staying rather quiet in the whole debate. It's not that I don't have an opinion; rather, I haven't been voicing it, mostly because my thoughts on the matter are rather convoluted, and I'm not sure I can put them in words pretty well. But since the debate seems to be going in circles, and since I think that I might be able to help it out, I'll try anyway. If that doesn't bring the debate any further, well—can't blame me for trying. Do I think that paying people to do work on Debian is necessarily a bad idea? No, not at all. Quite to the contrary; by paying people to work full-time on Debian for a while, we will increase the amount of work that's being put into Debian, which can only be a good thing; and I can only applaud aj for trying to get this done, since it should be obvious to anyone involved that by proposing this, he only had Debian's best interests (i.e., improve our distribution) in mind. The obvious question, however, is who we'll be paying. After all, while I don't think it's necessarily bad to pay people, I do think that with money, you introduce the possibility for corruption; so at the very least, it should be strictly checked what exactly happens with the money that's being spent. Accordingly, the way in which you pay people—your whole set-up and procedures—should have a number of safeguards. Which brings me to my next point. Do I think that dunc-tank, as a project, is a good idea? No, not at all. Quite to the contrary; and I have a number of reasons for that. First, I don't think that deciding by committee who can get paid and who can't is the way to go forward. A committee will always be biased, no matter how diverse its membership. Even if you could find a committee that would not be biased—an impossibility by definition—then there will still be people who will not agree with the committee's decision. Are they right not to agree? Not necessarily. Is it bad to create such controversy if it could be avoided? Sure as hell. Moreover, by creating a project which populated by Debian people and which has the sole purpose of paying Debian people, it's fairly possible to create the impression as if all other ways to get paid to do Debian work would be unsound; that if a developer accepts money for Debian work without going through the committee, he's somehow doing something morally wrong. This would be very bad indeed; I know for a fact that there are cases of people paying Debian Developers to work a few days on a certain package so that a certain bug which impedes their business is fixed, or so that a certain version of a random package which this company is using, is uploaded into the archive. I've done this myself in the past (although I can't disclose more, since the person who paid me requested to remain anonymous); I know of at least two other Debian Developers who've done so as well. So what's the alternative? Not paying people is one, but it would not necessarily be the best. One alternative, which the FreeBSD people have been doing, is just "no organization at all". People who want to get paid just put up a web page with the amount of money they would like to have, and it's up to the community to decide whether their request is valuable enough for it to actually be paid. To some extent, this is also what the Dunc-Tank committee is doing, except that as FreeBSD does it, there is no committee in between. The problem with this approach, however, is that if we all go ahead and ask for money at the same time, it'll not only make us look like a bunch of beggars; it's also very likely that nobody will actually be getting any money, because there's so many "good causes" to choose from that there might not be enough donations to fund them all. Another proposal was one made by Manoj Srivastava, before word got out that dunc-tank would be started (or about at that time). As it was made on the debian-private mailinglist, I can't say much about that; suffice to say that I feel his proposal was overly bureaucratic, and had problems that were similar in nature to the problems I feel exist with dunc-tank. So do I have any answers which might get us a way out of the current status quo? Nope, afraid not. But allow me to make this final observation: In early 2005, a bunch of people decided that Vancouver was a nice place, and travelled there. As it happened, they were all Debian people. What a coincidence. Since they were all there anyway, they decided to talk a bit about Debian, and as a result, came up with some plan for its future. This plan, too, was rather controversial. When I initially read the first few paragraphs, I suddenly fell upon this paragraph which seemed to say that "we suspect we won't be releasing with anything but i386, amd64, and powerpc for etch". Things suddenly turned kinda red from then on. And it wasn't the wine that I hadn't been drinking. I could imagine that what I felt back then is somewhat similar to what some other people felt when they first heard about the suggestion to pay people for Debian work; and this realization is the reason why I haven't been dismissing their arguments or thoughts, much like I've seen other people do. However, I cannot say that I feel the current crisis to have many parallels with the Vancouver one; this is mostly because in the current case, people on both side of the argument seem to be building up walls around themselves, to help them ignore whatever the other side was saying. In contrast, when someone suggested something which had the potential effect of throwing most of the hard work I'd been putting in for Debian during the last five years now down the drain, I went to talk with them—after the heat had cooled down, anyway. Unfortunately for me, the result is exactly the same; for all my effort, I haven't been able to prevent m68k to be kicked out of the release; and to say I'm not happy about this would be an understatement. But at least this way the port has been given a fair chance; if I hadn't been talking to the people on the other side of the argument, I would still be talking about Steve Langasek as if he were the devil himself, and m68k would probably have been kicked out of Debian—along with a bunch of other architectures. At least I got that bit fixed. I guess what I'm saying is that people, rather than farting in eachother's general direction, should be working together in finding a compromise. Sure, that means making concessions, and agreeing to something you'd rather not agree to. But at least this way, the thing you have to agree to isn't as bad as what it could be. And trust me, people will do the stuff you don't want them to do, with or without your agreement. Hell, they've already started. Whoa, long post. I'll stop preaching now.

18 December 2006

Manoj Srivastava: I am now an Ikiwiki user!

Ikiwiki
Well, this is first post. I have managed to migrate my blog over to Ikiwiki, including all the historical posts. The reasons for migration was that development on my older blogging mechanism, Blosxom, entered a hiatus, though recently it has been revived on sourceforge. I like the fact that IkiWiki is based on a revision control system, and that I know the author pretty darned well :-). One of my primary requirements for the migration was that I be able to replicate all the functionality of my existing Blog, and this included the look and feel (which I do happen to like, despite wincing I see from some visitors to my pages) of my blog. This meant replicating the page template and CSS from my blog. I immediately ran into problems: for example, my CSS markup for my blogs was based on being able to markup components of the date of the entry (day, day of week, month, etc) and achieve fancy effects; and there was no easy way to use preexisting functionality of IkiWiki to present the information to the page template. Thus was born the varioki plugin; which attempts to provide a means to add variables for use in ikiwiki templates, based on variables set by the user in the ikiwiki configuration file. This is fairly powerful, allowing for uses like:
    varioki =>  
      'motto'    => '"Manoj\'s musings"',
      'toplvl'   => 'sub  return $page eq "index" ',
      'date'     => 'sub   return POSIX::strftime("%d", gmtime((stat(srcfile($pagesources $page )))[9]));  '
      'arrayvar' => '[0, 1, 2, 3]',
      'hashvar'  => ' 1, 1, 2, 2 '
     ,

The next major stumbling block was archive browsing for older postings; Blosxom has a nice calendar plugin that uses a calendar interface to let the user navigate to older blog postings. Since I really liked the way this looks, I set about scratching this itch as well; and now ikiwiki has attained parity vis. a vis. calendar plugins with Blosxom. The calendar plugin, and the archive index pages, led me start thinking about the physical layout of the blog entries on the file system. Since the tagging mechanism used in ikiwiki does not depend on the location in the file system (an improvement over my Blosxom system), I could layout the blog postings in a more logical fashion. I ended up taking Roland Mas' advice and arranging for the blog postings to be created in files like:
 blog/year/month/date/simple_title.mdwn

The archives contain annual and monthly indices, and the calendar front end provides links to recent postings and to recent monthly indices. So, a few additions to the arch hook scripts, and perhaps an script to automatically create the directory structure for new posts, and to automatically create annual and monthly indices as needed, and I'll have a low threshold of effort blogging work flow for blog entries, and I might manage to blog more often than the two blog postings I have had all through the year so far.

4 November 2006

Anthony Towns: More DWN Bits

Following Joey’s lead, here’s some DWN-style comments on some of the stuff I’ve been involved in or heard of over the past week… A future for m68k has been planned on the release list, after being officially dropped as a release architecture in September. The conclusion of the discussion seems to be that we’ll move the existing m68k binaries from etch into a new “testing-m68k” suite that will be primarily managed by m68k porters Wouter Verhelst and Michael Schmitz, and aim to track the real testing as closely as can be managed. In addition the m68k will aim to make installable snapshots from this, with the aim of getting something as close as possible to the etch release on other architectures. A new trademark policy for Debian is finally in development, inspired by the Mozilla folks rightly pointing out that, contrary to what we recommend for Firefox, our own logos aren’t DFSG-free. Branden Robinson has started a wiki page to develop the policy. The current proposal is to retain two trademark policies – an open use policy for the swirl logo, that can be used by anyone to refer to Debian, with the logo released under an MIT-style copyright license, and left as an unregistered trademark; and an official use license for the bottle-and-swirl logo, with the logo being a registered trademark, but still licensed under a DFSG-free copyright license. The hope is that we can come up with at least one example, and hopefully more, of how to have an effective trademark without getting in the way of people who want to build on your work down the line. Keynote address at OpenFest. Though obviously too modest to blog about this himself, Branden Robinson is currently off in Bulgaria, headlining the fourth annual OpenFest, speaking on the topics of Debian Democracy and the Debian Package Management System. New Policy Team. After a few days of controversy following the withdrawal of the policy team delegation, a new policy team has formed consisting of Manoj Srivastava, Russ Allbery, Junichi Uekawa, Andreas Barth and Margarita Manterola. Point release of sarge, 3.1r4. A minor update to Debian stable was released on the 28th October, incorporating a number of previously released security updates. Updated sarge CD images have not been prepared at this time and may not be created until 3.1r5 is released, which is expected in another two months, or simultaneously with the etch release. Debian miniconf at linux.conf.au 2007. While it may technically not be supposed to be announced yet, there’s now a website for the the Debian miniconf at linux.conf.au 2007, to be held in Sydney on January 15th and 16th (with the rest of the conference continuing until the 20th). This year derived distributions are being explicitly encouraged to participate, so competition is likely to be high, and it’s probably a good idea to get your talk ideas sorted out pretty quickly if you want them to be considered!

Next.

Previous.